Practical Data-Dependent Metric Compression with Provable Guarantees

نویسندگان

  • Piotr Indyk
  • Ilya P. Razenshteyn
  • Tal Wagner
چکیده

How well can one compress a dataset of points from a high-dimensional space while preserving pairwise distances? Indyk and Wagner have recently obtained almost optimal bounds for this problem, but their construction (based on hierarchical clustering) is not practical. In this talk, I will show a new practical, quadtree-based compression scheme, whose provable performance essentially matches that of the result of Indyk and Wagner. In additional to the theoretical results, we will see experimental comparison of the new scheme and Product Quantization (PQ)–one of the most popular heuristics for distance-preserving compression–on several datasets. Unlike PQ and other heuristics that rely on the clusterability of the dataset, the new algorithm ends up being more robust. The talk is based on a joint work with Piotr Indyk and Tal Wagner. Organizer(s): Rutgers/DIMACS Theory of Computing

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Practical Algorithm for Topic Modeling with Provable Guarantees

Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but th...

متن کامل

Cluster-Aware Compression with Provable K-means Preservation

This work rigorously explores the design of clusterpreserving compression schemes for high-dimensional data. We focus on the K-means algorithm and identify conditions under which running the algorithm on the compressed data yields the same clustering outcome as on the original. The compression is performed using single and multi-bit minimum mean square error quantization schemes as well as a gi...

متن کامل

Error-Resilient Optimal Data Compression

The problem of communication and computation in the presence of errors is difficult, and general solutions can be time consuming and inflexible (particularly when implemented with a prescribed error detection/correction). A reasonable approach is to investigate reliable communication in carefully selected areas of fundamental interest where specific solutions may be more practical than general ...

متن کامل

On Computing Compression Trees for Data Collection in Sensor Networks

We address the problem of efficiently gathering correlated data from a wired or a wireless sensor network, with the aim of designing algorithms with provable optimality guarantees, and understanding how close we can get to the known theoretical lower bounds. Our proposed approach is based on finding an optimal or a near-optimal compression tree for a given sensor network: a compression tree is ...

متن کامل

Workload-Optimal Histograms on Streams

Histograms are used in many ways in conventional databases and in data stream processing for summarizing massive data distributions. Previous work on constructing histograms on data streams with provable guarantees have not taken into account the workload characteristics of databases which show some parts of the distributions to be more frequently used than the others; on the other hand, previo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017